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Abstract 

In its classical form, a consistent replicated service requires all replicas to witness the same 
evolution of the service state. Assuming a message-passing environment with a majority of 
correct processes, the necessary and sufficient information about failures for implementing a 
general state machine replication scheme ensuring consistency is captured by the Q failure 
detector. 

This paper shows that in such a message-passing environment, P is also the weakest failure 
detector to implement an eventually consistent replicated service, where replicas are expected 
to agree on the evolution of the service state only after some (a priori unknown) time. 

In fact, we show that P is the weakest to implement eventual consistency in any message- 
passing environment, i.e., under any assumption on when and where failures might occur. En¬ 
suring (strong) consistency in any environment requires, in addition to P, the quorum failure 
detector E. Our paper thus captures, for the first time, an exact computational difference be¬ 
tween building a replicated state machine that ensures consistency and one that only ensures 
eventual consistency. 


1 Introduction 

State machine replication |23i2>3 is the most studied technique to build a highly-available and con¬ 
sistent distributed service. Roughly speaking, the idea consists in replicating the service, modeled 
as a state machine, over several processes and ensuring that all replicas behave like one correct and 
available state machine, despite concurrent invocations of operations and failures of replicas. This 
is typically captured using the abstraction of a total order broadcast [3], where messages represent 
invocations of the service operations from clients to replicas (server processes). Assuming that the 
state machine is deterministic, delivering the invocations in the same total order ensures indeed 
that the replicas behave like a single state machine. Total order broadcast is, in turn, typically 
implemented by having the processes agree on which batch of messages to execute next, using the 
consensus abstraction [2T1 [3]. (The two abstractions, consensus and total order broadcast, were 
shown to be equivalent [3].) 

Replicas behaving like a single one is a property generally called consistency. The sole purpose 
of the abstractions underlying the state machine replication scheme, namely consensus and total 
order broadcast, is precisely to ensure this consistency, while providing at the same time availability , 
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namely that the replicated service does not stop responding. The inherent costs of these abstractions 
are sometimes considered too high, both in terms of the necessary computability assumptions about 
the underlying system ns 121 m, and the number of communication steps needed to deliver an 
invocation [2H 22] • 

An appealing approach to circumvent these costs is to trade consistency with what is sometimes 
called eventual consistency [23] [28]: namely to give up the requirement that the replicas always 
look the same, and replace it with the requirement that they only look the same eventually , i.e., 
after a finite but not a priori bounded period of time. Basically, eventual consistency says that the 
replicas can diverge for some period, as long as this period is finite. 

Many systems claim to implement general state machines that ensure eventual consistency in 
message-passing systems, e.g., me]. But, to our knowledge, there has been no theoretical study 
of the exact assumptions on the information about failures underlying those implementations. This 
paper is the first to do so: using the formalism of failure detectors M, it addresses the question 
of the minimal information about failures needed to implement an eventually consistent replicated 
state machine. 

It has been shown in [2] that, in a message-passing environment with a majority of correct 
processes, the weakest failure detector to implement consensus (and, thus, total order broadcast 0) 
is the eventual leader failure detector, denoted fh In short, II outputs, at every process, a leader 
process so that, eventually, the same correct process is considered leader by all. Q can thus be 
viewed as the weakest failure detector to implement a generic replicated state machine ensuring 
consistency and availability in an environment with a majority of correct processes. 

We show in this paper that, maybe surprisingly, the weakest failure detector to implement an 
eventually consistent replicated service in this environment (in fact, in any environment) is still fl. 
We prove our result via an interesting generalization of the celebrated “CHT proof” by Chandra, 
Hadzilacos and Toueg [2]. In the CHT proof, every process periodically extracts the identifier of a 
process that is expected to be correct (the leader ) from the valencies of an ever-growing collection 
of locally simulated runs. We carefully adjust the notion of valency to apply this approach to 
the weaker abstraction of eventual consensus, which we show to be necessary and sufficient to 
implement eventual consistency. 

Our result becomes less surprising if we realize that a correct majority prevents the system 
from being partitioned, and we know that both consistency and availability cannot be achieved 
while tolerating partitions mmm- Therefore, in a system with a correct majority of processes, 
there is no gain in weakening consistency: (strong) consistency requires the same information 
about failures as eventual one. In an arbitrary environment, however, i.e., under any assumptions 
on when and where failures may occur, the weakest failure detector for consistency is known to be 
H + S, where E |8j returns a set of processes (called a quorum) so that every two such quorums 
intersect at any time and there is a time after which all returned quorums contain only correct 
processes. We show in this paper that ensuring eventual consistency does not require E: only H 
is needed, even if we do not assume a majority of correct processes. Therefore, E represents the 
exact difference between consistency and eventual consistency. Our result thus theoretically backs 
up partition-tolerance mm as one of the main motivations behind the very notion of eventual 
consistency. 

We establish our results through the following steps: 

• We give precise definitions of the notions of eventual consensus and eventual total order 
broadcast. We show that the two abstractions are equivalent. These underlie the intuitive 
notion of eventual consistency implemented in many replicated services mm- 

• We show how to extend the celebrated CHT proof [2], initially establishing that fI is necessary 
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for solving consensus, to the context of eventual consensus. Through this extension, we 
indirectly highlight a hidden power of the technique proposed in [2] that somehow provides 
more than was used in the original CHT proof. 

• We present an algorithm that uses to implement, in any message-passing environment, an 
eventually consistent replicated service. The algorithm features three interesting properties: 
(1) An invocation can be performed after the optimal number of two communication steps, 
even if a majority of processes is not correct and even during periods when processes disagree 
on the leader, i.e., partition periods; H (2) If H outputs the same leader at all processes from 
the very beginning, then the algorithm implements total order broadcast and hence ensures 
consistency; (3) Causal ordering is ensured even during periods where II outputs different 
leaders at different processes. 

The rest of the paper is organized as follows. We present our system model and basic definitions 
in Section[2j In Section[3j we introduce abstractions for implementing eventual consistency: namely, 
eventual consensus and eventual total order broadcast, and we prove them to be equivalent. We 
show in Section [4] that the weakest failure detector for eventual consensus in any message-passing 
environment is fl. We present in Section [5] our algorithm that implements eventual total order 
broadcast using fl in any environment. Section [6] discusses related work, and Section [7] concludes 
the paper. In the optional appendix, we present some proofs omitted from the main paper, discuss 
an alternative (seemingly relaxed but equivalent) definition of eventual consensus, and recall basic 
steps of the CHT proof. 

2 Preliminaries 

We adopt the classical model of distributed systems provided with the failure detector abstraction 
proposed in 13 In particular we employ the simplified version of the model proposed in mm 

We consider a message-passing system with a set of processes n = {pi,P 2 , ■ ■ ■ ,Pn} {n > 2). 
Processes execute steps of computation asynchronously, i.e., there is no bound on the delay between 
steps. However, we assume a discrete global clock to which the processes do not have access. The 
range of this clock’s ticks is N. Each pair of processes are connected by a reliable link. 

Processes may fail by crashing. A failure pattern is a function F : N — > 2 n , where F(t) is the 
set of processes that have crashed by time t. We assume that processes never recover from crashes, 
i. e., F(t) C F(t + 1). Let faulty(F) = Ua=N F(t) be the set of faulty processes in a failure pattern 
F ; and correct(F) = n — faulty (F) be the set of correct processes in F. An environment, denoted 
£, is a set of failure patterns. 

A failure detector history H with range 1Z is a function H : n x N —> 1Z, where Ff(p,t) is 
interpreted as the value output by the failure detector module of process p at time t. A failure 
detector V with range 1Z is a function that maps every failure pattern F to a nonempty set of 
failure detector histories. T>(F) denotes the set of all possible failure detector histories that may 
be output by V in a failure pattern F. 

For example, at each process, the leader failure detector H outputs the id of a process; further¬ 
more, if a correct process exists, then there is a time after which H outputs the id of the same 
correct process at every correct process. Another example is the quorum failure detector E, which 
outputs a set of processes at each process. Any two sets output at any times and by any processes 
intersect, and eventually every set output at any correct process consists of only correct processes. 

1 Note that three communication steps are, in the worst case, necessary when strong consistency is required [221 . 
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An algorithm A is modeled as a collection of n deterministic automata, where A(p) specifies the 
behavior of process p. Computation proceeds in steps of these automata. In each step, identified as 
a tuple (p, m, d, A), a process p atomically (1) receives a single message m (that can be the empty 
message A) or accepts an input (from the external world), (2) queries its local failure detector 
module and receives a value d, (3) changes its state according to A(p), and (4) sends a message 
specified by A(p) for the new state to every process or produces an output (to the external world). 
Note that the use of A ensures that a step of a process is always enabled, even if no message is sent 
to it. 

A configuration of an algorithm A specifies the local state of each process and the set of messages 
in transit. In the initial configuration of A, no message is in transit and each process p is in the 
initial state of the automaton A(p). A schedule S' of A, is a finite or infinite sequence of steps of A 
that respects A(p) for each p. 

Following mi, we model inputs and outputs of processes using input histories Hj and output 
histories Ho that specify the inputs each process receives from its application and the outputs 
each process returns to the application over time. A run of algorithm A using failure detector T> in 
environment £ is a tuple R = (F, H , Hj, Ho , S, T ), where F is a failure pattern in £, H is a failure 
detector history in V(F), Hi and Ho are input and output histories of A, S' is a schedule of A, and 
T is a list of increasing times in N, where T[i\ is the time when step S[i] is taken. H 6 V(F), the 
failure detector values received by steps in S are consistent with H, and Hi and Ho are consistent 
with S. An infinite run of A is admissible if (1) every correct process takes an infinite number of 
steps in S\ and (2) each message sent to a correct process is eventually received. 

We then define a distributed-computing problem , such as consensus or total order broadcast, as 
a set of tuples {Hi, Ho) where Hi is an input history and Ho is an output history. An algorithm A 
using a failure detector T> solves a problem P in an environment £ if in every admissible run of A 
in £, the input and output histories are in P. Typically, inputs and outputs represent invocations 
and responses of operations exported by the implemented abstraction. If there is an algorithm that 
solves P using T>, we sometimes, with a slight language abuse, say that T> implements P. 

Consider two problems P and P'. A transformation from P to P' in an environment £ m is 
a map Tp^pt that, given any algorithm Ap solving P in T, yields an algorithm Apt solving P' in 
£. The transformation is asynchronous in the sense that Ap is used as a “black box” where Apt 
is obtained by feeding inputs to Ap and using the returned outputs to solve P'. Hence, if P is 
solvable in £ using a failure detector T>, the existence of a transformation Tp^pt in £ establishes 
that P' is also solvable in £ using V. If, additionally, there exists a transformation from P' to P 
in £, we say that P and P' are equivalent in £. 

Failure detectors can be partially ordered based on their “power”: failure detector V is weaker 
than failure detector T>' in £ if there is an algorithm that emulates the output of V using T>' in 
£ 12 [Hj. If F> is weaker than V , any problem that can be solved with T> can also be solved with 
T>'. For a problem P, T>* is the weakest failure detector to solve P in £ if (a) there is an algorithm 
that uses V* to solve P in £, and (b) T>* is weaker than any failure detector V that can be used to 
solve P in T. 

3 Abstractions for Eventual Consistency 

We define two basic abstractions that capture the notion of eventual consistency: eventual total 
order broadcast and eventual consensus. We show that the two abstractions are equivalent: each 
of them can be used to implement the other. 
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Eventual Total Order Broadcast (ETOB) The total order broadcast (TOB) abstraction |16| 
exports one operation broadcastTOB(m) and maintains, at every process pi, an output variable di. 
Let di(t) denote the value of di at time t. Intuitively, di(t) is the sequence of messages pi delivered 
by time t. We write m £ di(t) if m appears in di(t). 

A process pi broadcasts a message m at time t by a call to broadcastTOB(m). We say that a 
process pi stably delivers a message m at time t if p t appends m to di(t) and m is never removed 
from di after that, i.e., m ^ di(t — 1) and Vt' > t: m £ di(t'). Note that if a message is delivered 
but not stably delivered by pi at time t, it appears in di(t) but not in di(t r ) for some t' > t. 
Assuming that broadcast messages are distinct, the TOB abstraction satisfies: 

TOB-Validity If a correct process pi broadcasts a message m at time t, then pi eventually stably 
delivers m, i.e., Vi " > t! : rri £ di{t") for some t! > t. 

TOB-No-creation If m £ di(t), then m was broadcast by some process pj at some time t' < t. 

TOB-No-duplication No message appears more than once in di(t). 

TOB-Agreement If a message m is stably delivered by some correct process pi at time t, then m 
is eventually stably delivered by every correct process pj. 

TOB-Stability For any correct process pi, di(t\) is a prefix of di(t 2 ) for all fi,f 2 £ N, ii < £ 2 - 

TOB-Total-order Let pi and pj be any two correct processes such that two messages m± and m 2 
appear in di(t) and dj{t) at time t. If m\ appears before m 2 in di(t), then mi appears before 
m 2 in dj(t). 

We then introduce the eventual total order broadcast (ETOB) abstraction, which maintains the 
same inputs and outputs as TOB (messages are broadcast by a call to broadcastETOB(m)) and 
satisfies, in every admissible run, the TOB-Validity, TOB-No-creation, TOB-No-duplication, and 
TO B-Agreement properties, plus the following relaxed properties for some r £ N: 

ETOB-Stability For any correct process pi, di{t\) is a prefix of di(t 2 ) for all t±,t 2 £ N, r < t\ < £ 2 - 

ETOB-Total-order Let pi and pj be correct processes such that messages mi and m 2 appear in 
di(t) and dj(t) for some t> r. If mi appears before m 2 in di(t), then mi appears before m 2 
in dj(t). 

As we show in this paper, satisfying the following optional (but useful) property in ETOB does not 
require more information about failures. 

TOB-Causal-Order Let pi be a correct process such that two messages mi and m 2 appear in 
di(t) at time t £ N. If m 2 depends causally of mi, then mi appears before m 2 in di(t). 

Here we say that a message m 2 causally depends on a message mi in a run R, and write 
mi m 2 , if one of the following conditions holds in R: (1) a process pi sends mi and then sends 
m 2 , (2) a process pi receives mi and then sends m 2 , or (3) there exists m 3 such that mi —tp, m 3 
and m 3 — m 2 - 

Eventual Consensus (EC) The consensus abstraction (C) [llj exports, to every process pi, a 
single operation proposeC that takes a binary argument and returns a binary response (we also say 
decides ) so that the following properties are satisfied: 

C-Termination Every correct process eventually returns a response to proposeC. 

C-Integrity Every process returns a response at most once. 
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C- Agreement No two processes return different values. 

C- Validity Every value returned was previously proposed. 

The eventual consensus (EC) abstraction exports, to every process pi, operations proposeECi , 
proposeEC 2 , ■ ■ ■ that take binary arguments and return binary responses. Assuming that, for all 
j G N, every process invokes proposeECj as soon as it returns a response to proposeECj_ 1 , the 
abstraction guarantees that, in every admissible run, there exists k G N, such that the following 
properties are satisfied: 

EC-Termination Every correct process eventually returns a response to proposeECj for all j G N. 
EC-Integrity No process responds twice to proposeECj for all j G N. 

EC- Validity Every value returned to proposeECj was previously proposed to proposeECj for all 
j G N. 

EC- Agreement No two processes return different values to proposeECj for all j > k. 

It is straightforward to transform the binary version of EC into a multivalued one with unbounded 
set of inputs [23] . In the following, by referring to EC we mean a multivalued version of it. 

Equivalence between EC and ETOB It is well known that, in their classical forms, the consensus 
and the total order broadcast abstractions are equivalent [3]. In this section, we show that a similar 
result holds for our eventual versions of these abstractions. 

The intuition behind the transformation from EC to ETOB is the following. Each time a process 
Pi wants to ETOB-broadcast a message m, p sends m to each process. Periodically, every process 
Pi proposes its current sequence of messages received so far to EC. This sequence is built by 
concatenating the last output of EC (stored in a local variable di) to the batch of all messages 
received by the process and not yet present in di. The output of EC is stored in di , i.e., at any 
time, each process delivers the last sequence of messages returned by EC. 

The correctness of this transformation follows from the fact that ECeventually returns consis¬ 
tent responses to the processes. Thus, eventually, all processes agree on the same linearly growing 
sequence of stably delivered messages. Furthermore, every message broadcast by a correct process 
eventually appears either in the delivered message sequence or in the batches of not yet delivered 
messages at all correct processes. Thus, by EC-Validity of EC, every message ETOB-broadcast 
by a correct process is eventually stored in di of every correct process pi forever. By construc¬ 
tion, no message appears in di twice or if it was not previously ETOB-broadcast. Therefore, the 
transformation satisfies the properties of ETOB. 

The transformation from ETOB to EC is as follows. At each invocation of the EC primitive, the 
process broadcasts a message using the ETOB abstraction. This message contains the proposed 
value and the index of the consensus instance. As soon as a message corresponding to a given 
eventual consensus instance is delivered by process pi (appears in di), pi returns the value contained 
in the message. 

Since the ETOB abstraction guarantees that every process eventually stably delivers the same se¬ 
quence of messages, there exists a consensus instance after which the responses of the transformation 
to all alive processes are identical. Moreover, by ETOB-Validity, every message ETOB-broadcast by 
a correct process pi is eventually stably delivered. Thus, every correct process eventually returns 
from any EC-instance it invokes. Thus, the transformation satisfies the EC specification. 

Theorem 1. In any environment £, EC and ETOB are equivalent. 
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From EC to ETOB To prove this result, it is sufficient to provide a protocol that implements 
ETOB in an environment £ knowing that there exists a protocol that implements EC in this en¬ 
vironment. This transformation protocol 7ec->etob is stated in Algorithm [lj Now, we are going 
to prove that 7ec-s>etob implements ETOB. Assume that there exists a message m broadcast by a 
correct process pt at time t. As pt is correct, every correct process receives the message push(m ) in 
a finite time. Then, m appears in the set toDeliver of all correct processes in a finite time. Hence, 
by the termination property of EC and the construction of the function NewBatch, there exists 
£ such that m is included in any sequence submitted to proposeEC^. By the EC-Validity and the 
EC-Termination properties, we deduce that pi stably delivers m in a finite time, that proves that 
7ec->etob satisfies the TOB-Validity property. If a process p t delivers a message m at time t, then 
m appears in the sequence responded by its last invocation of proposeEC e . By construction and by 
the EC-Validity property, this sequence contains only messages that appear in the set toDeliver of 
a process pj at the time pj invokes proposeEC But this set is incrementally built at the reception 
of messages push that contains only messages broadcast by a process. This implies that 7ec-s-etob 
satisfies the TOB-No-creation. As the sequence outputted at any time by any process is the re¬ 
sponse to its last invocation of proposeEC and that the sequence submitted to any invocation of 
this primitive contains no duplicated message (by definition of the function NewBatch ), we can 
deduce from the EC-Validity property that 7ec-s-etob satisfies the TOB-No-duplication. Assume 
that a correct process pi stably delivers a message m, i.e., there exists a time after which m always 
appears in d t . By the algorithm, m always appears in the response of proposeEC to pt after this 
time. As EC-Agreement property is eventually satisfied, we can deduce that m always appears in 
the response of proposeEC for any correct process after some time. Thus, any correct process stably 
delivers m, and 7ec->etob satisfies the TOB-Agreement. Let r be the time after which the EC prim¬ 
itive satisfies EC-Agreement and EC-Validity. Let pt be a correct process and t <t\ < t, 2 . Let £\ 
(respectively £ 2 ) be the integer such that di(t\) (respectively (^(£ 2 )) is the response of proposeEC\ 
(respectively proposeEC £ ). By construction of the protocol and the EC-Agreement and EC-Validity 
properties, we know that, after time r, the response of proposeEC to correct processes is a prefix 
of the response of proposeEC^ +1 . As we have £\ < £ 2 , we can deduce that 7ec-s>etob satisfies the 
ETOB-Stability property. Let pt and pj be two correct processes such that two messages mi and 
m 2 appear in di(t) and dj(t) at time t > r. Let l be the smallest integer such that m\ and m 2 
appear in the response of proposeEC By the EC-Agreement property, we know that the response 
of proposeEC\ is identical for all correct processes. Then , by the ETOB-Stability property proved 
above, that implies that, if mi appears before m 2 in di(t), then mi appears before m 2 in dj(t). In 
other words, 7ec->etob satisfies the ETOB-Total-order property. In conclusion, 7ec->etob satisfies 
the ETOB specification in an environment £ provided that there exists a protocol that implements 
EC in this environment. 

From ETOB to EC To prove this result, it is sufficient to provide a protocol that implements EC in 
an environment £ given a protocol that implements ETOB in this environment. This transformation 
protocol 7etob->ec is stated in Algorithm|2j Now, we are going to prove that 7etob-s>ec implements 

EC. 

Let pi be a correct process that invokes proposeEC({y) with £ e N. Then, by fairness and the 
TOB-Validity property, the construction of the protocol implies that the ETOB primitive delivers 
the message (£, v ) to pt in a finite time. By the use of the local timeout, we know that pi returns from 
proposeECg(v) in a finite time, that proves that 7etob-s>ec satisfies the EC-Termination property. 

The update of the variable county to £ for any process pi that invokes proposeEC\ and the 
assumptions on operations proposeEC ensure us that p % executes at most once the function 
DecideEC(£,receivedi[Cli,£]). Hence, 7etob->ec satisfies the EC-Integrity property. 
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Algorithm 1 7ec-s>etob : transformation from EC to ETOB for process pi 
Proof. Output variable: 

dp sequence of messages of M (initially empty) outputted at any time by p, 

Internal variables: 

toDeliverp set of messages of M (initially empty) containing all messages received by pi 
countp integer (initially 0) that stores the number of the last instance of consensus invoked by Pi 

Messages: 

pushfm) with m a message of M 

Functions: 

Send(message) sends message to all processes (including pf) 

NewBatch(di , toDeliverf) returns a sequence containing all messages from the set toDeliveri\{m\m € 
di} 

On reception of broadcastETOB (to) from the application 

Send(push(m)) 

On reception of push(m) from pj 
toDeliver.i := toDeliveri U {to} 

On reception of d as response of proposeEC e 

d-i := d 

counti := counti + 1 

proposeEC count . (di.NewBatch(di, toDeliveri )) 

On local timeout 

If county = 0 then 
counti := 1 

proposeEC 1 (N ewBatch{di , toDeliveri )) 



Let r be the time after which the ETOB-Stability and the ETOB-Total-order properties are 
satisfied. Let k be the smallest integer such that any process that invokes proposeEC k in run r 
invokes it after r. 

If we assume that there exist two correct processes p % and pj that return different values to 
proposeECg with t > k, we obtain a contradiction with the ETOB-Stability, ETOB-Total-order, or 
TO B-Agreement property. Indeed, if pj returns a value after time r, that implies that this value 
appears in dj and then, by the TOB-Agreement property, this value eventually appears in dj. If pj 
returns a different value from pi, that implies that this value is the first occurrence of a message 
associated to proposeEC^ in dj at the time of the return of proposeEC^. After that, dj cannot satisfy 
simultaneously the ETOB-Stability and the ETOB-Total-order properties. This contradiction shows 
that 7etob-s.ec satisfies the EC-Agreement property. 

If we assume that there exists a process pj that returns to proposeEC^ with f£Na value that was 
not proposed to proposeEC we obtain a contradiction with the TOB-No-creation property. Indeed, 
the return of Pi from proposeEC\ is chosen in dj. that contains the output of the ETOB primitive 
and processes broadcast only proposed values. This contradiction shows that 7etob-^ec satisfies 
the EC-Validity property. 

In conclusion, 7etob-s.ec satisfies the EC specification in an environment £ provided that there 
exists a protocol that implements ETOB in this environment. □ 


Algorithm 2 7etob->ec : transformation from ETOB to EC for process p. 

Internal variables: 

count p integer (initially 0) that stores the number of the last instances of consensus invoked by Pi 
dp. sequence of messages (initially empty) outputted to p t by the ETOB primitive 
Functions: 

First{£ ): returns the value v such that (£, v) is the first message of the form (£, *) in dj if such messages 
exist, _L otherwise 

DecideEC(£,v ): returns the value v as response to proposeEC \ 

On invocation of proposeEC t (v) 

counti := £ 

broadcastETOB((£, v)) 

On local time out 

If First(countj) ^ _L then 
[ DecideEC (count j, First (count j)) 


4 The Weakest Failure Detector for EC 

In this section, we show that Cl is necessary and sufficient for implementing the eventual consensus 
abstraction EC: 

Theorem 2. In any environment £, O is the weakest failure detector for EC. 

Cl is necessary for EC Let £ be any environment. We show below that 0 is weaker than any 
failure detector V that can be used to solve EC in £. Recall that implementing Cl means outputting, 
at every process, the identifier of a leader process so that eventually, the same correct leader is 
output permanently at all correct processes. 
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First, we briefly recall the arguments use by Chandra et al. [2] in the original CHT proof deriving 
n from any algorithm solving consensus (to get a more detailed survey of the proof please rever to 
Appendix [B] or m Chapter 3]). The basic observation there is that a run of any algorithm using 
a failure detector induces a directed acyclic graph (DAG). The DAG contains a sample of failure 
detector values output by V in the current run and captures causal relations between them. Each 
process pi maintains a local copy of the DAG, denoted by Gf pi periodically queries its failure 
detector module, updates Gj by connecting every vertex of the DAG with the vertex containing 
the returned failure-detector value with an edge, and broadcasts the DAG. An edge from vertex 
\pi,d,m] to vertex \pj, d', m'} is thus interpreted as u pi queried V for the mth time and obtained 
value d and after that pj queried V for the m !th time and obtained value d . Whenever pi receives 
a DAG Gj calculated earlier by pj , pi merges Gi with Gj. As a result, DAGs maintained by the 
correct processes converge to the same infinite DAG G. The DAG Gi is then used by pi to simulate 
a number of runs of the given consensus algorithm A for all possible inputs to the processes. All 
these runs are organized in the form of a simulation tree Yj. The simulation trees Tmaintained 
by the correct processes converge to the same infinite simulation tree T. 

The outputs produced in the simulated runs of T* are then used by pi to compute the current 
estimate of fb Every vertex a of Yj is assigned a valency tag based on the decisions taken in all its 
extensions (descendants of a in Yj): a is assigned a tag v G {0,1} if o has an extension in which 
some process decides v. A vertex is bivalent if it is assigned both 0 and 1. It is then shown in (2] that 
by locating the same bivalent vertex in the limit tree Y, the correct process can eventually extract 
the identifier of the same correct process. (More details can be found in Appendix [B| and [2, 22].) 

We show that this method, originally designed for consensus, can be extended to eventual 
consensus (i.e., to the weaker EC abstraction). The extension is not trivial and requires carefully 
adjusting the notion of valency of a vertex in the simulation tree. 

Lemma 1. In every environments, if a failure detector V implements EC in S , then f l is weaker 
than V in S . 

Proof. Let A be any algorithm that implements EC using a failure detector V in an environment S. 
As in [2], every process pi maintains a failure detector sample stored in DAG Gi and periodically 
uses Gi to simulate a set of runs of A for all possible sequence of inputs of EC. The simulated runs 
are organized by pi in an ever-growing simulation tree Y A vertex of Yj is the schedule of a finite 
run of A “triggered” by a path in Gi in which every process starts with invoking proposeECi(v), 
for some v 6 {0,1}, takes steps using the failure detector values stipulated by the path in Gi and, 
once proposeECffv) is complete, eventually invokes proposeEC e+1 (v'), for some v' 6 {0,1}. (For the 
record, we equip each vertex of Y j with the path in Gi used to produce it.) A vertex is connected 
by an edge to each one-step extension of it. [^] 

Note that in every admissible infinite simulated run, EC-Termination, EC-Integrity and EC- 
Validity are satisfied and that there is k > 0 such that for all £ > k, the invocations and responses 
of proposeECp satisfy the EC-Agreement. 

Since processes periodically broadcast their DAGs, the simulation tree Yj constructed locally by 
a correct process pi converges to an infinite simulation tree Y, in the sense that every finite subtree 
of Y is eventually part of Yj. The infinite simulation tree Y, starting from the initial configuration 
of A and, in the limit, contains all possible schedules that can triggered by the paths DAGs Gj. 

“In [2], the simulated schedules form a simulation forest, where a distinct simulation tree corresponds to each 
initial configuration encoding consensus inputs. Here we follow m there is a single initial configuration and inputs 
are encoded in the form of input histories. As a result, we get a single simulation tree where branches depend on the 
parameters of proposeEG t calls. 
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Algorithm 3 Locating a bivalent vertex in T. 
k:= 1 

a := root of T 
while true do 

if a is fc-bivalent then break 
ai := a descendant of <7 in which 

EC-Agreement does not hold for proposeEC k 
<72 := a descendant of <ti in which every correct process 
completes proposeEC k and receives 
all messages sent to it in a 
choose k' > k and < 73 , a descendant of < 72 , such that 
fc'-tag of <73 contains {0,1} 

k := k' 

<7 := <73 


Consider a vertex a in T identifying a unique finite schedule of a run of A using T> in the current 
failure pattern F. For k > 0, we say that a is k-enabled if k = 1 or cr contains a response from 
proposeEC k _i at some process. Now we associate each vertex a in T with a set of valency tags 
associated with each “consensus instance” k, called the k-tag of a, as follows: 

• If a is /c-enabled and has a descendant (in T) in which proposeEC k returns x G {0,1}, then 
x is added to the /c-tag of a. 

• If <7 is ^-enabled and has a descendant in which two different values are returned by 
proposeEC k , then _L is added to the k -tag of <7. 

If a is not /c-enabled, then its /c-tag is empty. If the k -tag of a is {cc}, x G {0,1}, we say that 
<7 is (k, x)-valent (k-univalent). If the k -tag is {0,1}, then we say that cr is k-bivalent. If the k -tag 
of cr contains _L, we say that a is k-invalid 

Since A ensures EC-Termination in all admissible runs extending <7, each /c-enabled vertex <7, 
the k -tag of cr is non-empty. Moreover, EC-Termination and EC-Validity imply that a vertex in 
which no process has invoked proposeEC k yet has a descendant in which proposeEC k returns 0 and 
a descendant in which proposeEC k returns 1. Indeed, a run in which only v, v G {0,1} is proposed 
in instance k and every correct process takes enough steps must contain v as an output. Thus: 

(*) For each vertex cr, there exists k G N and cr 7 , a descendant of <7, such that k -tag of a' contains 

{ 0 , 1 }. 

If the “limit tree” T contains a /c-bivalent vertex, we can apply the arguments of [2] to extract 
O. Now we show that such a vertex exists in T. Then we can simply let every process locate the 
“first” such vertex in its local tree Yj. To establish an order on the vertices, we can associate each 
vertex cr of Y with the value m such that vertex \jj 1 , d. m] of G is used to simulate the last step of 
<7 (recall that we equip each vertex of Y with the corresponding path). Then we order vertices of 
Y in the order consistent with the growth of m. Since every vertex in G has only finitely many 
incoming edges, the sets of vertices having the same value of m are finite. Thus, we can break the 
ties in the m-based order using any deterministic procedure on these finite sets. 

Eventually, by choosing the first /c-bivalent vertex in their local trees Yj, the correct processes 
will eventually stabilize on the same /c-bivalent vertex <7 in the limit tree Y and apply the CHT 
extraction procedure to derive the same correct process based on k -tags assigned to cr’s descendants. 
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It remains to show that T indeed contains a /c-bivalent vertex for some k. Consider the procedure 
described in Algorithm [3] that intends to locate such a vertex, starting with the root of the tree. 

For the currently considered /c-enabled vertex a that is not ^-bivalent (if it is /c-bivalent, we are 
done), we use (*) to locate 03 , a descendant of a, such that ( 1 ) in < 73 , two processes return different 
values in proposeEC k in 0 - 3 , ( 2 ) in 0 - 3 , every correct process has completed proposeEC k and has 
received every message sent to it in a, and (3) the k! - tag of 0-3 contains {0,1}. 

Thus, the procedure in Algorithm [3] either terminates by locating a fc-bivalent tag and then we 
are done, or it never terminates. Suppose, by contradiction, that the procedure never terminates. 
Hence, we have an infinite admissible run of A in which no agreement is provided in infinitely 
many instances of consensus. Indeed, in the constructed path along the tree, every correct process 
appears infinitely many times and receives every message sent to it. This admissible run violated 
the EC-Agreement property of EC -a contradiction. 

Thus, the correct processes will eventually locate the same /c-bivalent vertex and then, as in (2|, 
stabilize extracting the same correct process identifier to emulate Cl. □ 

Cl is sufficient for EC Chandra and Toueg proved that Cl is sufficient to implement the classical 
version of the consensus abstraction in an environment where a majority of processes are correct 
[3]. In this section, we extend this result to the eventual consensus abstraction for any environment. 

The proposed implementation of EC is very simple. Each process has access to an failure 
detector module. Upon each invocation of the EC primitive, a process broadcasts the proposed value 
(and the associated consensus index). Every process stores every received value. Each process pi 
periodically checks whether it has received a value for the current consensus instance from the 
process that it currently believes to be the leader. If so, pi returns this value. The correctness of 
this EC implementation relies on the fact that, eventually, all correct processes trust the same leader 
(by the definition of Q) and then decide (return responses) consistently on the values proposed by 
this process. 

Lemma 2. In every environment £, EC can be implemented using Cl. 

Proof. We propose such an implementation in Algorithm[4j Then, we prove that any admissible run 
r of the algorithm in any environment £ satisfies the EC-Termination, EC-Integrity, EC-Agreement, 
and EC-Validity properties. 

Assume that a correct process never returns from an invocation of proposeEC in r. Without 
loss of generality, denote by i the smallest integer such that a correct process p, : never returns 
from the invocation of proposeEC This implies that pi always evaluates receivedi[Cli, county] to 
_L. We know by definition of Cl that, eventually, Cli always returns the same correct process pj. 
Hence, by construction of pj returns from proposeEC 0 ,..., proposeEC and then sends the 
message promote(v , £) to all processes in a finite time. As pi and pj are correct, pi receives this 
message and updates receivedi[Cli, countf\ to v in a finite time. Therefore, the algorithm satisfies 
the EC-Termination property. 

The update of the variable counti to I for any process pi that invokes proposeEC \ and the 
assumptions on operations proposeEC ensure us that pi executes at most once the function 
DecideEC(£,receivedi[Cli,£]). Hence, the EC-Integrity property is satisfied. 

Let tq be the time from which the local outputs of Cl are identical and constants for all correct 
processes in r. Let k be the smallest integer such that any process that invokes proposeEC k in r 
invokes it after tq. 

Let £ be an integer such that £ > k. Assume that pi and pj are two processes that respond 
to proposeEC ^ Then, they respectively execute the function DecideEC(£,receivedi[Cli, £]) and 
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DecideEC(£,receivedj[Qji (]). By construction of k, we can deduce that Dj = f lj = pi- That 
implies that pi and pj both received a message promote(v,£) from pi. As pi sends such a message 
at most once, we can deduce that receivedi[pi , l] = receivedj\pi,£\, that proves that ensures the 
EC-Agreement property. 

Let t be an integer such that l > k. Assume that pi is a process that respond to proposeEC e . 
The value returned by pi was previously received from Q t in a message of type promote. By 
construction of the protocol, sends only one message of this type and this latter contains the 
value proposed to fh, hence, the EC-Validity property is satisfied. 

Thus, Algorithm [4] indeed implements EC in any environment using D. □ 


Algorithm 4 EC using Q: algorithm for process p h 
Local variables: 

counti'. integer (initially 0) that stores the number of the last instances of consensus invoked by pi 
receivedi'. two dimensional tabular that stores a value for each pair of processes/integer (initially _L) 

Functions: 

DecideEC(£, v) returns the value v as a response to proposeEC e 

Messages: 

promote(v,£ ) with v £ {0,1} and £ £ N 

On invocation of proposeEC t (v) 

counti '■= £ 

Send promote(v,£) to all 
On reception of promote(v,£) from pj 
receivedi[j , t] := v 
On local time out 

If receivedi [Hi, counti] / 1 do 
[ DecideEC (counti , receivedi [fh, counti ]) 


5 An Eventual Total Order Broadcast Algorithm 

We have shown in the previous section that O is the weakest failure detector for the EC abstraction 
(and, by Theorem [lj the ETOB abstraction) in any environment. In this section, we describe an 
algorithm that directly implements ETOB using 0 and which we believe is interesting in its own 
right. 

The algorithm has three interesting properties. First, it needs only two communication steps 
to deliver any message when the leader does not change, whereas algorithms implementing clas¬ 
sical TOB need at least three communication steps in this case. Second, the algorithm actually 
implements total order broadcast if 0 outputs the same leader at all processes from the very be¬ 
ginning. Third, the algorithm additionally ensures the property of TOB-Causal-Order, which does 
not require more information about faults. 

The intuition behind this algorithm is as follows. Every process that intends to ETOB-broadcast 
a message sends it to all other processes. Each process pi has access to an failure detector module 
and maintains a DAG that stores the set of messages delivered so far together with their causal 
dependencies. As long as pi considers itself the leader (its module of D outputs pi), it periodically 
sends to all processes a sequence of messages computed from its DAG so that the sequence respects 
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Algorithm 5 < STOB\ protocol for process pi 
Output variable: 

dp. sequence of messages m £ M (initially empty) output by pi 

Internal variables: 

promotei'. sequence of messages m £ M (initially empty) promoted by pi when fl; = pi 

CGp. directed graph on messages of M (initially empty) that contains causality dependencies known by pi 

Messages: 

update(CGi ) with CGi a directed graph on messages of M 
promote(promotei) with promotei a sequence of messages m £ M 

Functions: 

UpdateCG(m, C(m)) adds the node m and the set of edges {(m' ,m)\m' £ C(m)} to CGi 
UnionCG(CGj) replaces CGi by the union of CGi and CGj 

UpdatePromoteQ replaces promotei by one of the sequences of messages s such that promotei is a prefix of 
s, s contains once all messages of CGi, and for every edge (mi, m 2 ) of CGi, mi appears before m 2 in s 
On broadcastETOB(m,C(m )) from the application 
UpdateCG(m, C(m)) 

Send update(CGi) to all 
On reception of update(CGj) from pj 
U nionCG(CGj) 

U pdatePromote() 

On reception of promote(promotej) from pj 
If f li = pj then 
I di := promotej 

On local time out 

If Qi = pi then 

[ Send promote(promotei) to all 


the causal order and admits the last delivered sequence as a prefix. A process that receives a 
sequence of messages delivers it only if it has been sent by the current leader output by Q. The 
correctness of this algorithm directly follows from the properties of El. Indeed, once all correct 
processes trust the same leader, this leader promotes its own sequence of messages, which ensures 
the ETOB specification. 

The pseudocode of the algorithm is given in Algorithm [ 5 ]) . Below we present the proof of its 
correctness, including the proof that the algorithm additionally ensures TOB-Causal-Order. 

Lemma 3. In every environment £, Algorithm ETOB implements ETOB using 0. 

Proof. First, we prove that any run r of ETOB in any environment E satisfies the TOB-Validity, 
TOB-No-creation, TOB-No-duplication, and TO B-Agreement properties. 

Assume that a correct process p% broadcasts a message m at time t for a given f G N. We 
know that El outputs the same correct process pj to all correct processes in a finite time. As p 3 is 
correct, it receives the message updateifCGf) from pi (that contains m) in a finite time. Then, pj 
includes m in its causality graph (by a call to UnionCG ) and in its promotion sequence (by a call 
to UpdatePromote). As pj never removes a message from its promotion sequence and is outputted 
by El, pi adopts the promotion sequence of pj in a finite time and this sequence contains m, that 
proves that ETOB satisfies the TOB-Validity property. 

Any sequence outputted by any process is built by a call to UpdatePromote by a process p{. 
This function ensures that any message appearing in the computed sequence appears in the graph 
CGp. This graph is built by successive calls to UnionCG that ensure that the graph contains only 
messages received in a message of type update. The construction of the protocol ensures us that 
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such messages have been broadcast by a process. Then, we can deduce that £TOB satisfies the 
TOB-No-creation property. 

Any sequence outputted by any process is built by a call to Update Promote that ensures that 
any message appears only once. Then, we can deduce that £TOB satisfies the TOB-No-duplication 
property. 

Assume that a correct process pi stably delivers a message rn at time t for a given f £ N. We 
know that ft outputs the same correct process pj to all correct processes after some finite time. Since 
m appears in every di(t') such that t! > t, we derive that m appears infinitely in promotej from a 
given point of the run. Hence, the construction of the protocol and the correctness of pj implies 
that any correct process eventually stably delivers m, and £TOB satisfies the TOB-Agreement 
property. 

We now prove that, for any environment £, for any run r of £TOB in £, there exists a r G N 
satisfying ETOB-Stability, ETOB-Total-order, and TOB-Causal-Order properties in r. Hence, let r 
be a run of £TOB in an environment £. Let us define: 

• ro the time from which the local outputs of ft are identical and constant for all correct 
processes in r; 

• A c the longest communication delay between two correct processes in r; 

• A t the longest local timeout for correct processes in r; 

• r = tq + A t + A c 

Let pi be a correct process and pj be the correct elected by H after tq. Let t\ and t ,2 be 
two integers such that t < t\ < t 2 - As the output of ft is stable after tq and the choice of r 
ensures us that pi receives at least one message of type promote from pj , we can deduce from the 
construction of the protocol that there exists 1 3 < t\ and t± < t -2 such that dj(ti) = promotej (£ 3 ) 
and di(t 2 ) = promotej (tj). But the function UpdatePromote used to build promotej ensures that 
promotej(t;i) is a prefix of promote ^(£ 4 ). Then, £TOB satisfies the ETOB-Stability property after 
time r. 

Let pi and pj be two correct processes such that two messages m\ and m 2 appear in d, (t) and 
dj(t) at time t > r. Assume that m\ appears before m 2 in Let pk be the correct elected by H 

after tq . As the output of ft is stable after tq and the choice of r ensures us that pi and pj receive 
at least one message of type promote from pj, the construction of the protocol ensures us that we 
can consider t\ and t 2 such that di(t) = promotek(t\) and dj(t) = promotej-^) ■ The definition 
of the function UpdatePromote executed by pk allows us to deduce that either di (t) is a prefix of 
dj(t) or dj(t) is a prefix of di(t). In both cases, we obtain that m\ appears before m 2 in dj(t), that 
proves that £TOB satisfies the ETOB-Total-order property after time r. 

Let pi be a correct process such that two messages m\ and m 2 appear in di(t) at time t > 0. 
Assume that m\ G C(m 2 ) when m 2 is broadcast. Let pj be the process trusted by fli at the 
time pi adopts the sequence di(t). If m 2 appears in di(t), that implies that the edge (mi, m 2 ) 
appears in CGj at the time pj executes UpdatePromote (since pj previously executed UnionCG 
that includes at least m and the set of edges {(m / ,m)|m / G C(m)} in CGj). The construction of 
UpdatePromote ensures us that mi appears before m 2 in d t (t), that proves that £TOB satisfies 
the TOB-Causal-Order property. 

In conclusion, £TOB is an implementation of ETOB assuming that processes have access to the 
ft failure detector in any environment. □ 
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6 Related Work 


Modern data service providers such as Amazon’s Dynamo [7], Yahoo’s PNUTs [6] or Google Bigtable 
distributed storage [lj are intended to offer highly available services. They consequently replicate 
those services over several server processes. In order to tolerate process failures as well as partitions, 
they consider eventual consistency mmm- 

The term eventual consensus was introduced in [18]. It refers to one instance of consensus which 
stabilizes at the end; not multiple instances as we consider in this paper. In [9j, a self-stabilizing 
form of consensus was proposed: assuming a self-stabilizing implementation of oS (also described in 
the paper) and executing a sequence of consensus instances, validity and agreement are eventually 
ensured. Their consensus abstraction is close to ours but the authors focused on the shared-memory 
model and did not address the question of the weakest failure detector. 

In [TO] , the intuition behind eventual consistency was captured through the concept of eventual 
serializability. Two kinds of operations were defined: (1) a “stable” operation of which response 
needs to be totally ordered after all operations preceding it and (2) “weak” operations of which 
responses might not reflect all their preceding operations. Our ETOB abstraction captures consis¬ 
tency with respect to the “weak” operations. (Our lower bound on the necessity of D naturally 
extends to the stronger definitions.) 

Our perspective on eventual consistency is closely related to the notion of eventual linearizability 
discussed recently in [26| and [T5 ]. It is shown in [26] that the weakest failure detector to boost 
eventually linearizable objects to linearizable ones is 0 P- We are focusing primarily on the weakest 
failure detector to implement eventual consistency, so their result is orthogonal to ours. 

In {T5], eventual linearizability is compared against linearizability in the context of implementing 
specific objects in a shared-memory context. It turns out that an eventually linearizable imple¬ 
mentation of a fetch-and-increment object is as hard to achieve as a linearizable one. Our ETOB 
construction can be seen as an eventually linearizable universal construction : given any sequential 
object type, ETOB provides an eventually linearizable concurrent implementation of it. Brought 
to the message-passing environment with a correct majority, our results complement m- we show 
that in this setting, an eventually consistent replicated service (eventually linearizable object with 
a sequential specification) requires exactly the same information about failures as a consistent 
(linearizable) one. 

7 Concluding Remarks 

This paper defined the abstraction of eventual total order broadcast and proved its equivalence to 
eventual consensus: two fundamental building blocks to implement a general replicated state ma¬ 
chine that ensures eventual consistency. We proved that the weakest failure detector to implement 
these abstractions is O, in any message-passing environment. We could hence determine the gap 
between building a general replicated state machine that ensures consistency in a message-passing 
system and one that ensures only eventual consistency. In terms of information about failures, this 
gap is precisely captured by failure detector E [8J. In terms of time complexity, the gap is exactly 
one message delay: an operation on the strongly consistent replicated must, in the worst case, incur 
three communication steps [22], while one build using our eventually total order broadcast protocol 
completes an operation in the optimal number of two communication steps. 

Our ETOB abstraction captures a form of eventual consistency implemented in multiple repli¬ 
cated services 13 El S]. In addition to eventual consistency guarantees, such systems sometimes 
produce indications when a prefix of operations on the replicated service is committed, i.e., is not 
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subject to further changes. A prefix of operations can be committed, e.g., in sufficiently long pe¬ 
riods of synchrony, when a majority of correct processes elect the same leader and all incoming 
and outgoing messages of the leader to the correct majority are delivered within some fixed bound. 
We believe that such indications could easily be implemented, during the stable periods, on top of 
ETOB. Naturally, our results imply that fl is necessary for such systems too. 

Our EC abstraction assumes eventual agreement, but requires integrity and validity to be always 
ensured. Other definitions of eventual consensus could be considered. In particular, we have studied 
an eventual consensus abstraction assuming, instead of eventual agrement, eventual integrity, i.e., 
a bounded number of decisions in a given consensus instance could be revoked a finite number 
of times. In Appendix [Aj we define this abstraction of eventual irrevocable consensus (EIC) more 
precisely and show that it is equivalent to our EC abstraction. 
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A Discussion on Eventual Consensus 


Our definition of Eventual Consensus EC relaxes the Agreement property which holds after a finite 
number of operations. We could instead relax the Integrity property where processes can change 
their decisions a finite number of times. We discuss here the resulting abstraction. 

A.l Eventual Irrevocable Consensus (EIC) 

The eventual irrevocable consensus (EIC) abstraction exports, to every process pi, operations 
proposeEIC^, proposeEICi , ... that take binary arguments and return binary responses. If a pro¬ 
cess pi responds more than once to proposeEIC\ for some £ G N, we consider that the response of 
Pi to proposeEIC t at time t G N is its last response to proposeEIC t before t. 

Assuming that every process receives proposeEIC\ as soon as it returns a (first) response to 
proposeEIC for all £ G N, the abstraction guarantees, for every run, there exists k G N such that 
the following properties are satisfied: 

EIC-Termination Every correct process eventually returns a response to proposeEIC£ for all £ G N. 
ElC-Integrity No process responds twice to proposeEIC^ for all £ > k. 

EIC-Agreement No two processes return infinitely different values to proposeEIC\ for any £ G N. 

EIC-Validity Every value returned to proposeEIC )• was previously proposed to proposeEIC ) for all 
j G N. 

Theorem 3. In every environment £ , EC and EIC are equivalent. 

A.2 Transformation from EC to EIC 

Lemma 4. In every environment £ , there exists a transformation from EC to EIC. 

Proof. To prove this result, it is sufficient to provide a protocol that implements EIC in an en¬ 
vironment £ knowing that there exists a protocol that implements EC in this environment. This 
transformation protocol 7ec->eic is stated in Algorithm[6j Now, we are going to prove that 7ec->eic 
implements EIC. 

As any invocation of proposeEIC\ by a correct process pi leads to an invocation of ProposeEC£ 
by the same process, the EC-Termination property ensures us that pi receives eventually a response 
(a sequence decison ) from the EC primitive. Before this response, we have decisioni[£\ = _L. By 
the EC-Validity property, we know that decision[£\ is a value proposed by one process (hence not 
equal to _L). Then, the construction of the protocol ensures us that DecideEIC(£, decision[£}) is 
executed in a finite time, that proves that 7ec-»eic satisfies the EIC-Termination property. 

Let k be the index after which the EC primitive satisfies EC-Agreement property. Let r be the 
smallest time where all correct processes receive the response of proposeEC k . 

After time r, we know that the sequences decision returned to all process are identical. Then, 
the construction of the protocol ensures us that every sequence submitted to the EC primitive is 
prefixed by the last sequence returned by this primitive. Hence, the EC-Agreement property ensures 
us that, after time r, DecideEIC is executed only for the last value of the decision sequence and 
only when this sequence grows, that proves that Tec^eic satisfies the ElC-Integrity property. 

Assume that two processes pi and pj return forever two different values for proposeEIC e for 
some £. By the ElC-Integrity property proved above, we know that pi and pj take at most one 
decision for proposeEIC\ after time r. That implies that pi and pj return different values at their 
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last decision. Then, we can deduce that decisional] A decisionj[l\ forever, that is contradictory 
with the definition of t or with the EC-Agreement property. This contradiction shows us that 
"7ec->eic satisfies the EIC-Agreement property. 

The fact that 7ec->eic satisfies the EIC-Validity directly follows from the EC-Validity. 

In conclusion, 7ec-s>eic satisfies the EIC specification in an environment £ provided that there 
exists a protocol that implements EC in this environment. □ 


Algorithm 6 7ec->eic : transformation from EC to EIC for process pi 

Internal variables: 

decisionp. sequence of values decided by pt (initially e) 

Functions: 

DecideEIC(£,v ) returns the value uasa response to proposeEICg 

On invocation of proposeEIC^v) 

proposeEC e (decisioni .v) 

On reception of decision as response of proposeEC e 

For k from 0 to £ do 

If decision[k] A decisioni[k] then 
[ DecideEIC'(k, decision[k]) 
decisiont := decision 


A.3 Transformation from EIC to EC 

Lemma 5. In every environment £, there exists a transformation from EIC to EC. 

Proof. To prove this result, it is sufficient to provide a protocol that implements EC in an envi¬ 
ronment £ knowing that there exists a protocol that implements EIC in this environment. This 
transformation protocol 7eic-^ec is stated in Algorithm]?} Now, we are going to prove that 7eic->ec 
implements EC. 

As any invocation of proposeEC £ by a correct process pi leads to an invocation of proposeEIC £ 
by the same process, the EIC-Termination property ensures us that p% receives eventually at least 
one response from the EIC primitive. The use of the counter counti allows us to deduce that only 
the first response from the EIC primitive leads to a decision for proposeEC e by pi, that proves that 
7eic->ec satisfies the EC-Termination and the EC-Integrity properties. 

The construction of the protocol and the EIC-Agreement and the EIC-Validity properties trivially 
imply that 7eic->-ec satisfies the EC-Agreement and the EC-Validity properties. 

In conclusion, 7eic-s>ec satisfies the EC specification in an environment £ provided that there 
exists a protocol that implements EIC in this environment. □ 


B Background on the CHT proof 

Let £ be any environment, V be any failure detector that can be used to solve consensus in £, and 
A be any algorithm that solves consensus in £ using V. We determine a reduction algorithm Tx>_s.q 
that, using failure detector V and algorithm A, implements P in £. Recall that implementing 
means outputting, at every process, the id of a process so that eventually, the id of the same correct 
process is output permanently at all correct processes. 
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Algorithm 7 7eic->ec ; transformation from EIC to EC for process pi 

Internal variables: 

countp. integer (initially 0) that stores the number of the last instance of consensus invoked by pi 

Functions: 

DecideEC(£, v) returns the value v as a response to proposeEC e 

On invocation of proposeEC e (v) 

counti '■= £ 
proposeEIC^v ) 

On reception of v as response of proposeEIC e 

If counti = l then 
[ DecicLeEC(t,v) 


B.l Overview of the reduction algorithm 

The basic idea underlying is to have each process locally simulate the overall distributed 

system in which the processes execute several runs of A that could have happened in the current 
failure pattern and failure detector history. Every process then uses these runs to extract 0. 

In the local simulations, every process p feeds algorithm A with a set of proposed values, one 
for each process of the system. Then all automata composing A are triggered locally by p which 
emulates, for every simulated run of A, the states of all processes as well as the emulated buffer of 
exchanged messages. 

Crucial elements that are needed for the simulation are (1) the values from failure detectors 
that would be output by V as well as (2) the order according to which the processes are taking 
steps. For these elements, which we call the stimuli of algorithm A, process p periodically queries 
its failure detector module and exchanges the failure detector information with the other processes. 

The reduction algorithm Tp_>.n consists of two tasks that are run in parallel at every process: 
the commmuncation task and the computation task. In the communication task, every process 
maintains ever-growing stimuli of algorithm A by periodically querying its failure detector module 
and sending the output to all other processes. In the computation task, every process periodically 
feeds the stimuli to algorithm A, simulates several runs of A, and computes the current emulated 
output of fi. 

B.2 Building a DAG 

The communication task of algorithm Tp_s.Q is presented in Figure [l] Executing this task, p knows 
more and more of the processes’ failure detector outputs and temporal relations between them. All 
this information is pieced together in a single data structure, a directed acyclic graph (DAG) G p . 
Informally, every vertex [q, d, k] of G p is a failure detector value “seen” by q in its fc-th query of its 
failure detector module. An edge ([q,d,k], [q',d',k']) can be interpreted as “q saw failure detector 
value d (in its fc-th query) before q' saw failure detector value d' (in its k'- th query)”. 

DAG G p has some special properties which follow from its construction. Let F be the current 
failure pattern in £ and H be the current failure detector history in V(F). Then: 

(1) The vertices of G p are of the form [q,d,k] where q G II, d 6 7 Zx> and k G N. There is a 
mapping r : vertices of G p H > T, associating a time with every vertex of G p , such that: 

(a) For any vertex v = [q,d,k], q ^ F(r(v)) and d = H(q,r(v)). That is, d is the value 
output by q's failure detector module at time t(v). 
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Gp 4— empty graph 
kp 4 — 0 

while true do 

receive message m 

dp 4— query failure detector V 

kp t— kp 1 

if m is of the form ( q , G q ,p ) then G p G p U G q 

add [ p , d p , kp] and edges from all vertices of G p to [p, d p , k p \ to G p 

send (p, G pi q ) to all q € II 


Figure 1: Building a DAG: process p 


(b) For any edge (v,v') in G pi t{v ) < t{v'). That is, any edge in G p reflects the temporal 
order in which the failure detector values are output. 

(2) If v' = [i q , d, k] and v" = [q, d !, k'] are vertices of G p , and k < k', then (v, v') is an edge of G p . 

(3) G p is transitively closed: if (v,v') and (v\v") are edges of G p , then (v,v") is also an edge of 
G p . 

(4) For all correct processes p and q and all times t, there is a time t' > t, a d G TZx> and a k G N 
such that for every vertex v of G p (t), (v, [q, d, k]) is an edge of G p (t')|^] 

Note that properties (l)-(4) imply that, for every correct process p, t G T and k G N, there is 
a time t' such that G p {t') contains a path g = [qi, d\, ki] —> [q 2 , d^ fo] —> . .., such that (a) every 
correct process appears at least k times in g , and (b) for any path g' in G p (t), g' ■ g is also a path 
in Gp(t r ). 

B.3 Simulation trees 

Now DAG G p can be used to simulate runs of A. Any path g = [q\,di, Aq], [q 2 ,d 2 , ^ 2 ], ..., [q s , d s , A: s ] 
through G p gives the order in which processes q ±, q- 2 ,... ■ q s “see”, respectively, failure detector 
values d\, d±, d 2 , ■ ■ ■, d s . That is, g contains an activation schedule and failure detector outputs for 
the processes to execute steps of A’s instances. Let I be any initial configuration of A. Consider 
a schedule S that is applicable to I and compatible with g, i.e., 151 = s and Vfc G {1,2,..., s}, 
S[k] = (qk,Trik, dk), where mk is a message addressed to qj~ (or the null message A). 

All schedules that are applicable to I and compatible with paths in G p can be represented as a 
tree T p , called the simulation tree induced by G p and I. The set of vertices of is the set of all 
schedules S that are applicable to I and compatible with paths in G p . The root of T p is the empty 
schedule S±. There is an edge from S to S' if and only if S' = S ■ e for a step e; the edge is labeled 
e. Thus, every vertex S of Y p is associated with a sequence of steps ei e 2 ... e s consisting of labels 
of the edges on the path from S± to S. In addition, every descendant of S in Y p corresponds to 
an extension of e± e 2 ... e s . 

The construction of Y p implies that, for any vertex S of T p , there exists a partial run 
(F. H, I, S, T) of A where F is the current failure pattern and H G F>(F) is the current failure 
detector history. Thus, if in S, correct processes appear sufficiently often and receive sufficiently 
many messages sent to them, then every correct (in F) process decides in S(I). 

3 For any variable x and time t, x(t) denotes the value of x at time t. 
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[pi, rfl, fel 


\p2,d 2 , 



(a) 



Figure 2: A DAG and a tree 


In the example depicted in Figure [2j a DAG (a) induces a simulation tree a portion of which is 
shown in (b). There are three non-trivial paths in the DAG: [pi, d\, k{\ —> \p 2 , d 2 , k 2 ] —> [pi, d 3 , A; 3 ], 
\p 2 ,d 2 ,k 2 ] -)• \pi,d 3 ,k 3 \, \p 2 ,d 2 ,k 2 } -> [pi, d 3 , /c 3 ] and \pi,di,ki\ ->• [pi,d 3 ,fc 3 ]. Every path through 
the DAG and an initial configuration I induce at least one schedule in the simulation tree. 
Hence, the simulation tree has at least three leaves: (pi,A,di) (p 2 ,rn 2 , d 2 ) (pi,m 3 ,d 3 ), (p 2 ,X,d 2 ) 
(pi , m|j, d 3 ), and (pi, A, d 3 ). Recall that A is the empty message: since the message buffer is empty 
in /, no non-empty message can be received in the first step of any schedule. 

B.4 Tags and valences 

Let I 1 , i G {0,1,... ,n} denote the initial configuration of A in which processes pi,... ,pi propose 1 
and the rest (processes Pi+i, ... ,p n ) propose 0. In the computation task of the reduction algorithm, 
every process p maintains an ever-growing simulation forest T p = {YjJ, Y*,..., Y™} where Y* 
(0 < i < n) denotes the simulation trees induced by G p and initial configurations P. 

For every vertex of the simulation forest, p assigns a set of tags. Vertex S of tree Y^ is assigned 
a tag v if and only if S has a descendant S' in Y^ such that p decides v in S'(P). We call the set 
tags the valence of the vertex. By definition, if S has a descendant with a tag v , then S has tag v. 
Validity of consensus ensures that the set of tags is a subset of {0,1}. 

Of course, at a given time, some vertices of the simulation forest T p might not have any tags 
because the simulation stimuli are not sufficiently long yet. But this is just a matter of time: if 
p is correct, then every vertex of p’s simulation forest will eventually have an extension in which 
correct processes appear sufficiently often for p to take a decision. 

A vertex S of Y* is 0 -valent if it has exactly one tag {0} (only 0 can be decided in S' s extensions 
in Yp. A 1-valent vertex is analogously defined. If a vertex S has both tags 0 and 1 (both 0 and 
1 can be decided in S' s extensions), then we say that S is bivalent^ 

It immediately follows from Validity of consensus that the root of Y® can at most be 0-valent, 
and the root of Y” can at most be 1-valent (the roots of Y° and Y™ cannot be bivalent). 

4 The notion of valence was first defined in cu as the set of values than are decided in all extensions of a given 
execution. Here we define the valence as only a subset of these values, defined by the simulation tree. 
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B.5 Stabilization 


Note that the simulation trees can only grow with time. As a result, once a vertex of the simulation 
forest T p gets a tag v, it cannot lose it later. Thus, eventually every vertex of T p stabilizes 
being 0-valent, 1-valent, or bivalent. Since correct processes keep continuously exchanging the 
failure detector samples and updating their simulation forests, every simulation tree computed by 
a correct process at any given time will eventually be a subtree of the simulation forest of every 
correct process. 

Formally, let p be any correct process, t be any time, i be any index in {0,1,..., n}, and S be 
any vertex of T l p (t). Then: 

(i) There exists a non-empty V C {0,1} such that there is a time after which the valence of S is 
V. (We say that the valence of S stabilizes on V at p.) 

(ii) If the valence of S stabilizes on V at p, then for every correct process q, there is a time after 
which S is a vertex of T q and the valence of S stabilizes on V at q. 

Hence, the correct processes eventually agree on the same tagged simulation subtrees. In 
discussing the stabilized tagged simulation forest, it is thus convenient to consider the limit in¬ 
finite DAG G and the limit infinite simulation forest T = {T°, T 1 ,..., T n } such that for all 
i € {0,1,. .., n} and all correct processes p, \Jt & jG p (t) = G and Ut g TrTp(t) = YL 

B.6 Critical index 

Let p be any correct process. We say that index i € {1, 2,..., n} is critical if either the root of T* 
is bivalent or the root of Y* _1 is 0-valent and the root of T* is 1-valent. In the first case, we say 
that i is bivalent critical. In the second case, we say that i is univalent critical. 

Lemma 6. There is at least one critical index in {1,2,..., n}. 

Proof. Indeed, by the Validity property of consensus, the root of T° is 0-valent, and the root of T 1 
is 1-valent. Thus, there must be an index i £ {1,2,... ,n} such that the root of Y* _1 is 0-valent, 
and T* is either 1-valent or bivalent. □ 

Since tagged simulation forests computed at the correct processes tend to the same infinite tagged 
simulation forest, eventually, all correct processes compute the same smallest critical index i of 
the same type (univalent or bivalent). Now we have two cases to consider for the smallest critical 
index: (1) i is univalent critical, or (2) i is bivalent critical. 

(1) Handling univalent critical index 

Lemma 7. If i is univalent critical, then pi is correct. 

Proof. By contradiction, assume that p t is faulty. Then G contains an infinite path g in which 
Pi does not participate and every correct process participates infinitely often. Then T* contains a 
vertex S such that p, does not take steps in S and some correct process p decides in S(P). Since i 
is 1-valent, p decides 1 in S(P). But pi is the only process that has different states in P~ l and P, 
and pi does not take part in S. Thus, S is also a vertex of Y* _1 and p decides 1 in S(P~ 1 ). But 
the root of Y* _1 is 0-valent — a contradiction. □ 
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( 2 ) Handling bivalent critical index 


Assume now that the root of T* is bivalent. Below we show that T* then contains a decision gadget , 
i.e., a finite subtree which is either a fork or a hook (Figure [3]). 


Sx 


S± 




(b) 


Figure 3: A fork and a hook 

A fork (case (a) in Figure [3]) consists of a bivalent vertex S from which two different steps by 
the same process q, consuming the same message m, are possible which lead, on the one hand, to 
a 0-valent vertex So and, on the other hand, to a 1-valent vertex Si. 

A hook (case (b) in Figure [3]) consists of a bivalent vertex S, a vertex S' which is reached by 
executing a step of some process q, and two vertices So and Si reached by applying the same step 
of process q' to, respectively, S and S'. Additionally, So must be 0-valent and Si must be 1-valent 
(or vice versa; the order does not matter here). 

In both cases, we say that q is the deciding process , and S is the pivot of the decision gadget. 

Lemma 8. The deciding process of a decision gadget is correct. 

Proof. Consider any decision gadget 7 defined by a pivot S, vertices So and Si of opposite valence 
and a deciding process q. By contradiction, assume that q is faulty. Let g, go and g\ be the 
simulation stimuli of, respectively, S, So and Si. Then G contains an infinite path g such that 
(a) g ■ g, go ■ g, g\ ■ g are paths in G, and (b) q does not appear and the correct processes appear 
infinitely often in g. 

Let 7 be a fork (case (a) in Figure [ 3 ]). Then there is a finite schedule S compatible with a 
prefix of g and applicable to S{P) such that some correct process p decides in S ■ S(P); without 
loss of generality, assume that p decides 0. Since q is the only process that can distinguish S(P) 
and Si(P), and q does not appear in S, S is also applicable to S\(P). Since g\ ■ g is a path of G 
and S is compatible with a prefix of g, it follows that S\ ■ S is a vertex of T*. Hence, p also decides 
0 in Si ■ S(P). But S\ is 1-valent — a contradiction. 

Let 7 be a hook (case (b) in Figure [ 3 ]). Then there is a finite schedule S compatible with a 
prefix of g and applicable to Sq{P ) such that some correct process p decides in So ■ S(P). Without 
loss of generality, assume that So is 0-valent, and hence p decides 0 in Sq ■ S(P). Since q is the only 
process that can distinguish Sq(P) and S\(P), and q does not appear in S, S is also applicable to 
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S\(I l ). Since gi ■ g is a path of G and S is compatible with a prefix of g, it follows that S± ■ S is a 
vertex of Y* *. Hence, p also decides 0 in S\ ■ S(P) But S\ is 1-valent — a contradiction. □ 

Now we need to show that any bivalent simulation tree T* contains at least one decision gadget 7 . 

Lemma 9. If i is bivalent critical, then T* contains a decision gadget. 

Proof. Let i be a bivalent critical index. In Figure [4j we present a procedure which goes through 
Y*. The algorithm starts from the bivalent root of Y* and terminates when a hook or a fork has 
been found. 


S^- S± 

while true do 

p <r- (choose the next correct process in a round robin fashion) 
to <— (choose the oldest undelivered message addressed to p in S(P)) 
if (S has a descendant S' in T 1 (possibly S = S') such that, for some d, 
S' ■ (p, to, d) is a bivalent vertex of T l ) 
then S' S' ■ {p, to., d) 
else exit 


Figure 4: Locating a decision gadget 

We show that the algorithm indeed terminates. Suppose not. Then the algorithm locates an 
infinite fair path through the simulation tree, i.e., a path in which all correct processes get scheduled 
infinitely often and every message sent to a correct process is eventually consumed. Additionally, 
this fair path goes through bivalent states only. But no correct process can decide in a bivalent 
state S(P) (otherwise we would violate the Agreement property of consensus). As a result, we 
constructed a run of A in which no correct process ever decides — a contradiction. 

Thus, the algorithm in Figure [4] terminates. That is, there exist a bivalent vertex S, a correct 
process p, and a message m addressed to p in S(P) such that 

(*) For all descendants S' of S (including S' = S ) and all d, S' ■ (p, m, d ) is not a bivalent vertex 
of T\ 

In other words, any step of p consuming message m brings any descendant of S (including S 
itself) to either a 1-valent or a 0-valent state. Without loss of generality, assume that, for some d, 
S ■ ( p,m,d ) is a 0 -valent vertex of Y*. Since S is bivalent, it must have a 1 -valent descendant S". 

If S" includes a step in which p consumes m, then we define S' as the vertex of T* such that, 
for some d', S' ■ (p, rn, d') is a prefix of S". If S" includes no step in which p consumes m, then we 
define S' = S". Since p is correct, for some d 1 , S' ■ ( p,m,d') is a vertex of T\ In both cases, we 
obtain S' such that for some d', S' ■ (p, m, d') is a 1 -valent vertex of Y\ 

Let the path from S to S' go through the vertices <to = S, 07 ,..., a m = S'. By transitivity 

of G, for all k 6 {0,1,..., m}, ak ■ (p, m, d') is a vertex of T*. By (*), 07 • (p, rn, dl) is either 0-valent 
or 1-valent vertex of T®. 

Let k E {0,..., m} be the lowest index such that (p, m, d') brings 07 to a 1-valent state. We 
know that such an index exists, since a m ■ (p, m, d') is 1 -valent and all such resulting states are 
either 0 -valent or 1 -valent. 

Now we have the following two cases to consider: (1) k = 0, and (2) k > 0. 
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Assume that k = 0. i.e., ( p,m,d') applied to S brings it to a 1-valent state. But we know that 
there is a step ( p,m,d ) that brings 5 to a 0-valent state (Case 1 in Figure [5]). That is, a fork is 
located! 

If k > 0, we have the following situation. Step ( p,m,d') brings Ok-i to a 0-valent state, and 
o-fc = <7fc_i • (p',m!,d") to a 1-valent state (Case 2 in Figure [5]). But that is a hook! 

As a result, any bivalent infinite simulation tree has at least one decision gadget. □ 

B.7 The reduction algorithm 

Now we are ready to complete the description of Tx>^.q. In the computation task (Figure [6]), every 
process p periodically extracts the current leader from its simulation forest, so that eventually 
the correct processes agree on the same correct leader. The current leader is stored in variable 
fl-output p . 

Initially, p elects itself as a leader. Periodically, p updates its simulation forest T p by incorpo¬ 
rating more simulation stimuli from G p . If the forest has a univalent critical index i, then p outputs 
Pi as the current leader estimate. If the forest has a bivalent critical index i and Y* contains a 
decision gadget, then p outputs the deciding process of the smallest decision gadget in T z p (the 
“smallest” can be well-defined, since the vertices of the simulation tree are countable). 

Eventually, the correct processes locate the same stable critical index i. Now we have two cases 
to consider: 

(i) i is univalent critical. By Lemma [7J pi is correct. 

(ii) i is bivalent critical. By Lemma [9J the limit simulation tree Y* contains a decision gadget. 
Eventually, the correct processes locate the same decision gadget 7 in Yj and compute the 
deciding process q of 7. By Lemma [8j q is correct. 
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Initially: 

for i = 0, 1 ,..., n: T p <r- empty graph 
Q-OUtput p <— p 

while true do 

{ Build and tag the simulation forest induced by G p } 
for i = 0, l,...,ndo 

Y* i— simulation tree induced by G p and 1 1 
for every vertex S of T l p : 

if S has a descendant S' such that p decides v in S’ (I 1 ) then 
add tag v to S 

{ Select a process from the tagged simulation forest } 
if there is a critical index then 
i the smallest critical index 
if i is univalent critical then f l-output p <— Pi 
if Yp has a decision gadget then 

f l-output p <— the deciding process of the smallest decision gadget in Yp 

Figure 6: Extracting a correct leader: code for each process p 
Thus, eventually, the correct processes elect the same correct leader — Q is emulated! 
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